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Abstract 

Lower urinary tract symptoms (LUTS) can indicate the presence of urinary 
tract infection (UTI), a condition that if it becomes chronic requires expensive 
and time consuming care as well as leading to reduced quality of life. Detecting 
the presence and gravity of an infection from the earliest symptoms is then highly 
valuable. Typically, white blood cell count (WBC) measured in a sample of urine 
is used to assess UTI. We consider clinical data from 1341 patients at their first 
visit in which UTI (i.e. WBC> 1) is diagnosed. In addition, for each patient, a 
clinical profile of 34 symptoms was recorded. In this paper we propose a Bayesian 
nonparametric regression model based on the Dirichlet Process (DP) prior aimed at 
providing the clinicians with a meaningful clustering of the patients based on both 
the WBC (response variable) and possible patterns within the symptoms profiles 
(covariates). This is achieved by assuming a probability model for the symptoms as 
well as for the response variable. To identify the symptoms most associated to UTI, 
we specify a spike and slab base measure for the regression coefficients: this induces 
dependence of symptoms selection on cluster assignment. Posterior inference is 
performed through Markov Chain Monte Carlo methods. 

Keywords. Bayesian nonparametrics, clustering, variable selection, Dirichlet pro¬ 
cess, spike and slab priors 


1 Introduction 

In medical settings, individual level data are often collected for the relevant subjects on 
a variety of variables; these typically include background characteristics {e.g. sex, age, 
social circumstances) as well as information directly related to the interventions being 
applied {e.g. clinical measurements such as blood pressure or the results of a particular 
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test). This set up applies for both experimental and observational studies — perhaps 
even more so in the latter case, when data are often (although not always) collected 
using registries or administrative databases. 

Arguably, the most common use of such data involves some form of regression analysis 
where the main “outcome” variable is related to (some of) the covariates (or ‘^profiles’’') that 
have been collected. More specihcally, clinicians may be interested in identifying suitable 
subgroups of patients presenting similar features; this categorisation can be used, for 
example, to suitably apply the optimal treatment for the (sub)population that will beneht 
the most. Alternatively, the focus may be on hnding the covariates that best describe 
the variation in the outcome, for example in order to determine which symptoms should 
be measured to better characterise the chance that a new (as yet unobserved) patient is 
affected by a particular disease. The hrst of these tasks can be framed in the broader 
statistical problem of clustering, while the second one is an example of model selection 
(also called variable selection). 

More interestingly, because complex and heterogeneous data are increasingly often 
collected and used for the analysis, a further connection between clustering and model 
selection can be considered, i.e. that one (set of) covariate(s) may be relevant in explain¬ 
ing the outcome variable for a subset of subjects, but not for others. In other words, the 
two tasks can be mixed in a more comprehensive analysis strategy to produce cluster- 
specihc model selection. 

For example, the dataset motivating this work contains records of Lower Urinary Tract 
Sympthoms (LUTS), a group of symptoms indicating dysfunctions of the lower urinary 
tract, i.e. bladder and urethra, including incontinence and dysuria. This symptoms 
are related to several diagnosis: from anxiety, to multiple sclerosis or bladder tumour. 
However, the most common diagnosis associated with LUTS is Urinary Tract Infection 
(UTI). To examine possible UTI, samples of urine are analysed counting the number of 
White Blood Cells (WBC). The presence of WBC in urines, even in very small quantities, 
reveals UTI [1, 2] and very often large number of WBC are associated with high degree of 
inflammation. The database includes LUTS and WBC for a number of patients affected 
by UTI {i.e. WBC> 1). For each individual, the set of LUTS constitute the patient’s 
prohle, while WBC can be considered as an indicator of UTI, the actual condition being 
investigated; the clinical objective is to assess the potential relationship between the 
symptoms and the infection. 

A standard approach to deal with these problems is to employ generalised linear mod¬ 
els, including random effects (usually modelled using a Normal distribution) to account 
for heterogeneity between patients. This is clearly a restrictive assumption in many ap¬ 
plications as often the distribution of the random effects is non-Normal, multi-modal, or 
perhaps skewed. In our analysis, we move beyond the traditional parametric hierarchical 
models, in order to account for the known patient heterogeneity that cannot be described 
in a simple parametric model. This heterogeneity is a common feature of many biomedical 
data and assuming a parametric distribution or mis-specifying the underlying distribu¬ 
tion would impose unreasonable constraints; this in turn may produce poor estimates 
of parameters of interest. It is therefore important to use non-parametric approaches to 
allow random effects to be drawn from a sufficiently large class of distributions. That is 
the modelling strategy we adopt in this paper. 

In order to take into account the heterogeneity among the patients, it is convenient to 
study the relationship between the covariates and the response within groups of patients 
having similar symptoms prohles and similar levels of WBC {i.e. in a clustering setting). 


2 


In addition, it is crucial to evaluate which symptoms are explanatory of the level of 
WBC within each group (i.e. in a variable selection setting). The goal is to develop a 
method for assessing the relationship between a response variable (in our case WBC) and 
a set of covariates (the prohle) within clusters of patients with similar characteristics, 
in order to make prediction about the response for a new patients. This will ultimately 
provide valuable information on the mechanisms of action of the underlying disease being 
investigated. 

To this aim, we develop a modelling strategy based on Bayesian non-parametric meth¬ 
ods that allows us to accomplish both tasks at once. We propose a (potentially inhnite) 
mixture of regression models to link the response with the covariates, where also the 
weights of the mixture can depend on the covariates. In this way, observations will be 
clustered based on the information contained in both the clinical prohles and the out¬ 
come variable. Within each cluster, variable selection is achieved employing spike and 
slab prior distributions that assign positive probability to the regression coefficients being 
equal to zero. The Bayesian framework allows us to perform both tasks simultaneously 
in a probabilistically sound way, so that clustering and variable selection inform each 
other. The results of the application on LUTS data show that our formulation leads to 
improved predictions, in comparison to other existing methods. 

The rest of the work is organised as follows. In Sections 2 and 3 we briefly describe 
clustering and variable selection problems from a Bayesian perspective and we review 
the most relevant literature. Then in Section 4 we introduce the details of our proposed 
approach and briefly explain how to perform posterior inference while in Section 5 we 
show how to summarise posterior inference output in a meaningful way. In Section 6 we 
present an application of our model to the LUTS dataset mentioned above. Finally, in 
Section 7 we discuss our results and draw conclusions. 


2 Clustering via Bayesian non-parametrics methods 

Suppose we wish to investigate the clustering structure characterising a dataset y = 
(?/i,..., Un)- In this case we may assume that the observations come from K distinct clus¬ 
ters, each characterised by a parameter 0^. (fc = 1,..., iF), which specihes the probability 
distribution of the data in each cluster (note that, in general, 9k can itself be a vector). 
This implies that the probability distribution of each observation j/j is a mixture: 

K 

^^iJkPivilOk), (1) 

k=l 

where 6 = {6i ,..., 6k ) and ^J^ = {'ipi,... jiPk), with ipk indicating the probability that 
unit i belongs in cluster k. 

This approach is often referred to as “model-based clustering” and its aims include the 
estimation of 'ijjk and 6k, as well as of the number K, which is often the most challenging 
task. There are two popular approaches in the Bayesian literature to estimate the number 
of clusters: the hrst one consists in htting separate models for different values of K and 
then applying a model selection criterion (such as AIC, BIG or DIG) to identify the 
specihcation with most support from the data. The second one is fully grounded in a 
Bayesian framework and amounts to specifying a prior on K and then performing full 
posterior inference on all the model parameters. The latter approach is adopted in this 
paper by employing a non-parametric prior to allow for more flexibility. 
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The model in (1) can be extended to an infinite mixture model 

OO 

k=l 

— notice that although the structure in (2) allows for an inhnite number of clusters a 
priori, by necessity in a dataset made by n observations there can be at most n distinct 
groups. Within the Bayesian framework, an elegant and efficient way of dehning this 
model is to select a Dirichlet Process (DP) prior [3]. 

A DP is a stochastic process whose realisations are probability distributions such that 
if a random distribution Q ~ DP(a,^o) then it is almost surely discrete [3, 4, 5]. Qq 
denotes the base measure (representing some sort of baseline distributions around which 
Q is centred) and a G M"*" is the precision parameter. DP models are by far the most 
widely used non-parametric Bayesian model, mainly because of computational simplicity; 
this derives from the fact that the complexity in simulating from the relevant posterior 
distributions is essentially dimension-independent. 

A “constructive” dehnition of a DP is given in [6] using the following representation: 

OO 

Q = 

k=l 

where 5e^ is the Dirac measure taking value 1 in correspondence of 9^ and 0 otherwise. 
The inhnite set of parameters 61 , 62 ,... is drawn independently from the continuous 
distribution Qq and the weights " 01 , '^ 2 , • • • are constructed using a stick breaking procedure 

[ 7 ]: 

k-l 

JJ(1 -0j) 

i=i 

with each fj ~ Beta(l,a). The almost sure discreteness of the random distribution Q is 
particularly relevant in the case of clustering, since Q will provide the weights and the 
locations for the mixture in (2), leading to a Dirichlet Process Mixture (DPM) [8], which 
can be rewritten in a compact way as 

y \0 ~ p{y\ 0 ) 

G\g ^ G (3) 

G ~ DP(q;,^o)- 

For example, if p{y \ 9) is taken to be a Normal distribution with 9 = {y,, cr'^), then the 

distribution of y becomes an inhnite mixture of Normal kernels with weights determined 
by the DP prior, i.e. y ~ Y^'^=ii’k Normal(t/ | 

Given its discreteness, setting a DP prior on the parameter vector 9 implies a non-zero 
probability that two or more of its elements are equal. This, in turn, implies that the 
DP imposes a clustering structure on the data so that the observations will be grouped 
together in k < n clusters, each characterised by a specihc distribution. The parameter 
10 includes the prior probabilities of belonging in each cluster and 6 k denotes the cluster- 
specihc parameter. The advantage of this strategy is that the number of components 
k is also learned from the data through the posterior distribution. Thus, the vector of 
individual-level parameters 61 ,... , 6 n reduces to the vector of unique values 61,... ,61 
assigned to the n observations. 
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This is also evident integrating out Q from the joint distribution of {6i,. 
writing the conditional prior for each 6 i as in [9]: 


e,, I 6 


a 


[i] 


a-\- n — Y 


^0 H - 7 - 7 ^ ^e^l 

a + n — 1 * 


., 6 n) and 

(4) 




where 0(i) is obtained removing the f-th component from (0i,..., 6*„). The discrete part 
in (4) induces ties among the components of (^i,... ,9n)- In particular, di will take an 
already observed value within with probability l/(Q! + n — 1), while a new value (from 
Qo) with probability a/{a + n — Y). 

The latter links the DPM model of (3) with a Random Partition Model (RPM), he. 
a probabilistic model over partitions of the n observations, which are imposed by the 
DP prior. We denote with the partition of the n observations in k clusters, each 
characterised by the values of 9* for j = 1 ,..., fc. A membership vector s = (si,..., s^), 
describes which of the 0*’s is associated with each observation. Thus, s* takes on the 
values {1,2,..., fc} for i = 1,..., n. From (4) and assuming 9i as the last value to be 
sampled (this holds for every 9i since (4) is exchangeable [9]), the prior distribution over 
all the possible partitions of n observations implied by a DP is 


k 

p{pn) oc ]^a(nj - 1)! 
j=i 


(5) 


where rii,... ,nk is the cardinality of each cluster [10]. Given the partition, in model (3) 
the observations assigned to different clusters are modelled as independent. 


3 Extension of the DP prior in regression settings 

Many extension of the conventional DPM have been proposed in the literature. We 
describe here only the two relevant for our application. In particular, we work in a linear 
regression framework considering a vector of observations y = {yi,..., yn), for which we 
assume the model 

yi I Xi, f3i, Xi ~ Normal(|/j | Xif3i, A*), (6) 

where Xi is the z—th row of the {n x D) covariates matrix X, f3i is a vector of subject- 
specific regression coefficients and Aj is the individual-level precision. 

The coefficients /3j are usually modelled using a multivariate Normal prior; however, 
while it is important to maintain a simple interpretation and guarantee easy computation 
for the posteriors, we would like to accommodate heterogeneity in the population and to 
allow for outliers, clustering and over-dispersion. This higher level of flexibility is often 
difficult to achieve using a single parametric random effect distribution. In addition, 
we aim at identifying the explanatory variables that influence the outcome the most. 
Thus, we extend the basic Bayesian regression framework and set a non-parametric prior 
instead. 

3.1 Covariate dependent clustering 

Recent developments in the RPM literature (particularly within a regression framework) 
have focussed on creating partitions of the observations that successfully take into account 
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possible patterns within profiles determined by the combination of covariate values. This 
is particularly relevant for applications involving a large number of covariates that are 
expected to contain useful but correlated information about clustering and when the 
focus is on prediction. This type of models has been termed random partition model with 
covariates (RPMx) [11]. 

In order to allow the partition of the observations to depend on the values of the 
covariates x, several authors have proposed modifications of the basic clustering structure 
of the DP prior described in (5). For example, [12] propose to model the response 
and covariate(s) jointly using a DPM model [8] and then use the posterior conditional 
distribution of the response given the covariates and the partition of the observations to 
determine the relationship between y and x. An alternative model is the so-called product 
partition model with covariates (PPMx) [13]. In this case a prior is specified directly on 
the partition: 

k 

V{.Pn) OC 

i=i 

where c(-) is a “cohesion” function and Sj is the set containing the observation labels 
belonging to cluster j — note that the DPM is a particular case, induced by choosing 
c{Sj) OC o.{nj — 1)!. The PPMx extends the cohesion function to include a measure of 
similarity between covariates from different observations so that individuals with “similar” 
profiles are more likely to be grouped in the same cluster. 

Other extensions of DPM models which allow for covariate dependent clustering are 
given by the DP with Generalized Linear Model (DP-GLM) [14], which accommodates 
different types of responses in a GLM framework; the Generalized Product Partition 
Model (GPPM) [15], which first clusters the individuals on the basis of the covariates 
and then uses the posterior probability of the partitions obtained as prior for the partitions 
built on the response; and the Profile Regression (PR) [16], in which a DPM model is 
specified for the covariates, which are then linked to the response through cluster-specific 
regression models. 

3.2 Variable (model) selection 

The DP framework can be also used to perform variable selection in the linear model 
(6). If we assume a DP process prior for the regression coefficients /3i, then we obtain an 
infinite mixture of linear regression models. This implies a partition of the n observations 
in clusters, each of which is characterised by specific values of the regression coefficients 
{i.e. in the j-th cluster, (3* of length equal to the number of covariates, D). Thus, if we 
allow the parameter (3* to have some component (s) equal to zero with positive probability, 
we are effectively allowing for the possibility that some of the observed covariates in cluster 
j, denoted x*, do not affect the outcome y for the observations in that cluster, effectively 
performing variable selection. 

For example, [17] propose a Bayesian non-parametric regression model employing a 
DP prior on the regression coefficients, but choosing as base measure a spike and slab 
distribution, to allow some coefficients of the regression to be zero (see [18] and [19] for 
a review on spike and slab distribution for variable selection). Similar approaches can be 
found in [20] for linear mixed models and in [21] for binary regression. These methods 
perform covariate selection within each cluster and, as a consequence, observations are 
grouped on the basis of the effects of the covariates on the response. This strategy can 
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lead to poor clustering in the case of a high-dimensional covariate space, and also to 
reduced predictive performance. This is because possible patterns within the covariates 
are not taken into account when clustering individuals. 

Variable selection techniques have been developed also for the PPMx model [13] and 
for the PR model [22]. In this setup, it is possible to perform simultaneously covariate 
dependent clustering and selection of covariates that discriminate the most between clus¬ 
ters; however, no variable selection is performed to identify the covariates that mostly 
associated with the response of interest. 

Other examples of joint clustering and variable selection can be found in [23, 24, 25, 26] 
among the others, but once again these methods aim to highlight covariates that most 
explain the clustering structure, but not the relationship between covariates and outcome. 

The main goal of this paper is to combine covariate dependent clustering and variable 
selection methods able to identify covariates that best explain the outcome variable, by 
generalising the approach in [17]. We refer to the proposed model as Random Partition 
Model with covariate Selection (RPMS). 

4 Random Partition Model with Covaxiate Selection 
(RPMS) 

In this section we develop the RPMS model and briefly explain the Markov Chain Monte 
Carlo (MCMC) algorithm employed to perform posterior inference. 

4.1 Regression Model 

As discussed at the beginning of Section 3, we use a linear regression model to explain 
the relationship between the response and the covariates. Let y = (t/i,..., t/„) denote the 
response variable. Then, we assume 

Vi I Xi,f3i,Xi ~ Normal(t/i | Xif3i,Xi). 

Here, we assume that Xid G {0,1} for every i = 1,... ,n and d = 1,... ,D. Thus, we 
can interpret X as the matrix containing the information about the presence of D symp¬ 
toms for each of the n patients; these symptoms are assumed to possibly have an effect on 
the response y. We focus on binary covariates, because in clinical settings symptoms are 
often recorded as binary indicators (in fact, that is the case in our motivating example). 
Extensions to other type of covariates is however trivial. 

The goal is to specify a prior structure that allows detecting a possible clustering 
structure based on symptoms prohles and then identifying which variables most influence 
(globally or in some clusters) the response variable. 

4.2 Model on the Covariates and Prior Specification 

To allow for covariate dependent clustering, we exploit ideas in [12] assuming a probability 
model for the vectors of covariates: 


D 

cci I Cl, • • •, Cd ~ n Bernoulli(xid | Od)- 

d=l 
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In addition, we specify a joint DP prior distribution on Q = {Qi,..., (nj) and /3j = 
(Al; • ■ ■ ; Ad) : 

(/3l,Cl),...,(/3n,Cn) 1^ ~ ^ (7) 

Q ~ DP(a,^o) 

where a is the precision parameter and Qo is the base measure of the process. Recalling 
Section 1, the DP in (7) assigns a positive probability for two observations i and to have 
the same values (A, A) = (A') A')- W^e denote with (/3*, <^*) = ((Z?!, Ci), ■ ■ ■, Cl)) 
unique values for the parameters. This construction implies that observations are clus¬ 
tered on the basis of both their covariates profile and the relationship between covariates 
and responses. 

The model is completed by specifying a conjugate Gamma prior on the regression 
precision assuming Ai = ... = = A and modelling A ~ Gamma(A | ax,bx), as well 

as using a Gamma hyper-prior on the concentration parameter of the DP [27]; a ~ 
Gamma(aQ,, 6 ci,). These are common prior choices as they enable easier computations. 

4.2.1 The Spike and Slab Base Measure 

The choice of the base measure of the DP is crucial. We assume that A cind the A 
are independent in the base measures. We choose a spike and slab distribution as base 
measure for the regression coefficients to perform variable selection. Thus we define: 

D 

5o=n{Mo (Ad) + (1 - (^d) Normal(Ad | rud, Td)] Beta(Cd | ac, A)}) 

d=l 

which is simply the product measure on the space of the regression coefficients and of the 
parameters defining the distribution of the covariates. The notation Ad and C-d highlights 
the fact that the base measure is assumed to be the same across the observations. In Qq, 
the first part within the square brackets is the spike and slab distribution, while A (/3-d) 
is a Dirac measure that assigns probability 1 to the value 0. Thus a spike and slab 
distribution is a mixture of a point mass at 0 (in correspondence of which. Ad = 0) and 
a Normal distribution, with weights given by ujd and (1 — cj^), respectively. 

A conjugate base measure is employed for the covariate specific parameters Cd for 
ease of computations. We assume the same hyperpriors for the parameters in Qq as in 
[17]. In particular we set a spike and slab hyperpriors for each ujd- 

D 

CUi, . . . , Ud I TI"!, . . . , TTi) ~ Ipd - 7rd)(5o(i^d) + TTd Beta(a;d | a^, 6 ^;)} 

d=l 

D 

TTi, • • • 5 TT/j) I CI 71-5 ~ n Beta(7rd | A) 

d=l 

The latter solution has been proposed by [28] as the possibility to induce extra sparsity 
on the regression coefficients, encouraging those associated with the covariates having no 
effect on the response variable to shrink toward zero. As shown in [17], it is possible to 
integrate out Ud from the base measure, obtaining: 

D 

Go = ]^{[ 7 rdWa;A(Ad) + (1 - 7 rdtCa;)Normal(Ad I Tnd,Td)]Beta{Cd \ a^,K)} 

d=l 
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where = j^y 

We set mi = ... = = 0 ; in addition, we use a a Gamma prior for the precision 

parameters of the Normal component of the spike and slab prior: 


D 

Ti, . . . , Tjd I Clr: ^ n Gamma(rrf | aryb^). 

d=l 


4.3 Posterior Inference 

Markov Ghain Monte Garlo (MGMG) [29] algorithms have been adopted to sample from 
the posterior distributions of interest. Since our model can be rewritten using a DPM 
formulation on the response and the covariates jointly, efficient Gibbs sampler schemes are 
available. We follow the auxiliary parameter algorithm proposed in [30]. This procedure 
updates first the vector of cluster allocations s and then separately all the cluster specific 
parameters and the parameters that do not depend on the cluster allocation. A detailed 
description of the updating steps is reported in Appendix A. We present below a summary 
of the updating steps: 

(i) Update the membership indicator s = (si,..., Sn) using the Gibbs sampling proce¬ 
dure for non-conjugate base measure by the auxiliary variable algorithm presented 
in [30]. 

(ii) Update the precision of the DP, a, exploiting the method implemented in [27], 
setting a ~ Gamma(a | Oq,, ba) a priori. 

(iii) Update ..., <^^) from the full conditional distribution, given the new 

configuration of s in (i). 

(iv) Update f3* = (/3j‘,..., f3l) from the full conditional posterior distribution, given the 
new configuration of s in (i). 

(v) Update tt = (tti, ... , 110 ) from the full conditional distribution. To draw from this 
distribution we implement the algorithm described in [17]. 

(vi) Update r = (ri,... ,td) from the full conditional distribution. 

(vii) Update the precision of the regression A from the full conditional distribution. 


5 Summarizing Posterior Output 

The choice of a spike and slab base measure implies that the coefficients have positive 
probability to be equal to zero. We propose here two ways of summarising the MGMG 
output that highlight the effect of using a spike and slab prior distribution. These two 
methods are then applied to the real data example in the following section. 

In our framework a covariate can be explanatory for a cluster and not for another. 
Thus, a first method to analyse the results would be to compute the probability that the 
d—th covariate has explanatory power in cluster j, i.e. 7 ^ 0 ), given a partition of 

the observations in clusters. The literature proposes a variety of methods for extracting 
a meaningful partition from the MGMG output [31, 16]. In our application we have 
decided to report the partition obtained by minimizing the Binder loss function [32]. 


9 



Then, conditioning on the selected partition, we can compute the posterior distribution 
of the regression coefficients for each cluster, together with the probability of inclusion of 
a certain covariate, i.e. 1 — = 0 | s, ■), where s is the Binder configuration. 

A second way of summarising the posterior output from a variable selection perspec¬ 
tive is based on predictive inference. Let us consider the situation in which a new patient 
enters the study with profile x. Using the proposed approach, the posterior distribution 
of the regression coefficients depends on the structure of the patient’s profile. This is due 
to the fact that the cluster allocation depends on it. In fact, in RPMS the predictive 
distribution of the cluster allocation is: 

" D 

’^iWajdi.Xd) for j = 1,...,A; 


p(s I a;,...) oc 


d=i 

D 


(yY\_god{.Xd) 


for j = k + 1 


( 8 ) 


d=l 


where x = {xi,... ,xd) and s are the profile for the new patient and its cluster allocation, 
respectively. In addition, gjd{xd) = Cjd^i^ “ and god{xd) = (/q - 

/{Jq — u)^^ ^du) are the likelihood for the new observation to belong 

in cluster j and the prior predictive distribution of the new observation, respectively. The 
probability in (8) comes directly from the predictive scheme of the DP [4]. 

Hence, we focus on p{(3d = 0 | x,s,-). This probability can be approximated using 
the MCMC samples. Moreover, it is possible to look at the predictive distribution of the 
response, namely y, that by construction depends on the variable selection. Notice that 
the model in [17] does not specify a model on the covariates, and consequently clustering 
does not depend on the covariates profiles. 

Alternatively, we could compute the posterior probability of = ... = = 0 | ■) 

to summarize the overall importance of the d—th covariate. This posterior probability 
can be approximated empirically by calculating the proportion of iterations in the MCMC 
run in which the regression coefficient for the d—th covariate is equal to zero in all the 
clusters: = ... = = 0. 


6 Lower Urinary Tract Symptoms (LUTS) data 

In this section we present the results of the application of the RPMS to the LUTS 
database. First, we briefly describe the database, then we give details of the choice of the 
hyper-parameters and MCMC settings. We briefly introduce the competitor model and 
finally we report the results for clustering and variable selection. 

6.1 Data 

We consider data on 1341 patients extracted from the LUTS database collected at the 
LUTS clinic, Whittington Hospital Campus, University College London. For each patient, 
the presence of 34 LUTS has been recorded together with the White Blood Cell count 
(WBC) in a sample of urine. The patients are women aged over 18, affected by LUTS. 
We consider data at the first attendance visit. 

It is of clinical interest to investigate the relationship between LUTS and Urinary Tract 
Infection (UTI), where the latter is measured by the number of WBC. In particular, a 
value of WBC > 1 is indicative of the presence of infection. 
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The symptoms are stored as binary variables (1 indicates the presence of the symptoms 
and 0 the absence). We report the freqnency distribntion of the symptoms in the 1341 
patients in Table 6.1. The symptoms can be gronped into fonr categories: urgency 
symptoms (symptoms from 1 to 8), stress incontinence symptoms (9 to 14), voiding 
symptoms (15 to 21) and pain symptoms (22 to 34). 


Table 1: Lists of the 34 symptoms with the frequency of occurrence. 


Symptom 

Frequency 

Symptom 

Frequency 

1) Urgency incontinence 

0.4146 

18) Straining to void 

0.0828 

2) Latchkey urgency 

0.4280 

19) Terminal dribbling 

0.1641 

3) Latchkey incontinence 

0.2304 

20) Post void dribbling 

0.0820 

4) Waking urgency 

0.5496 

21) Double voiding 

0.1193 

5) Waking incontinence 

0.2595 

22) Suprapubic pain 

0.1611 

6) Running water urgency 

0.2901 

23) Filling bladder pain 

0.2148 

7) Running water incontinence 

0.1365 

24) Voiding bladder pain 

0.0567 

8) Premenstrual aggravation 

0.0515 

25) Post void bladder pain 

0.0723 

9) Exercise incontinence 

0.1462 

26) Pain fully relieved by voiding 

0.0634 

10) Laughing incontinence 

0.1536 

27) Pain partially relieved by voiding 

0.1260 

11) Passive incontinence 

0.0783 

28) Pain unrelieved by voiding 

0.0164 

12) Positional incontinence 

0.0850 

29) Loin pain 

0.2081 

13) Standing incontinence 

0.0895 

30) Iliac fossa pain 

0.0895 

14) Lifting incontinence 

0.1104 

31) Pain radiating to genitals 

0.0865 

15) Hesitancy 

0.1797 

32) Pain radiating to legs 

0.0649 

16) Reduced stream 

0.1909 

33) Dysuria 

0.1484 

17) Intermittent stream 

0.1514 

34) Urethral pain 

0.0507 


In this paper we focus only focus on patients with UTI (WBC> 1). We consider a 
logarithmic transformation of the WBC data and model the log-transformed WBC using 
a Normal distribution. The left panel in Figure 6 displays the kernel density estimation 
of the log-transformed WBC. 


6.2 Prior Specification 

The hyperparameters of the spike and slab prior in the base measure are set as follows: 
Oo; = Ott = 1, = K = 0.15, Qr = br = 1 and = 1. We note here that we set 

vague prior beliefs on the distribution of the parameters, except for the prior on tt^ and 
bJd to make computations more stable. The hyperparameters a\, b\ for the precision A in 
the regression density are set both to be equal to 1. Finally, for the prior on concentration 
parameter of the DP we use = &« = 1. 

We run the MCMC sampler for 10 000 iterations with a burn-in period of 1000 iter¬ 
ations. Details on the algorithm are given in Appendix A. To update the membership 
indicator, we use the auxiliary variable approach described in [30] that requires the choice 
of a tuning parameter M. In our experience, M = 100 gives a good trade-off between 
execution time and efficiency of the Gibbs sampler. The convergence of the chains is 
assessed by trace plots and by the Gelnian and Rubin’s convergence diagnostic [33], the 
latter for the parameters that do not depend on the cluster assignment. 


6.3 The competitor model: SSP 

In order to highlight the potential and advantages of RPMS, we compare its results with 
the model described in [17], which we believe is the closest competitor. For simplicity. 
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we refer to this model as SSP (Spike and Slab Prior). This assumes the same Normal 
specihcation for the WBC counts and a DP prior on the regression coefficients with a 
spike and slab base measure for the regression coefficients. 

The difference with our own specification consists in the fact the the SSP treats the 
covariates as given, instead of associated with a probability distribution. This implies 
that in the SSP the base measure of the DP prior reduces to: 

D 

Qq = + (1 - u}d)N{/3.d I md,Td)}. 

d=l 

We follow the same strategy adopted for the RPMS of integrating out from each 
part of the base measure the Ud- The model described in [17] involves also the use of 
a DP prior on the precision in the regression model, but for a fair comparison with the 
RPMS we use a version of the model without this further complication. Moreover, the 
results obtained by the SSP with or without the DP prior distribution on A have not 
shown to be signihcantly different. In the MCMC algorithm, we use the same initial 
values, tuning parameters and number of iterations utilised for the RPMS to obtain the 
posterior distributions. 

6.4 Clustering outputs 

The proposed method, as explained above, employs a DP prior for the regression coeffi¬ 
cients and for the parameters governing the prohle distribution. The main consequence 
is that the implied clustering is influenced by both the distribution of the y = log(WBC) 
and by possible patterns within the prohles. 

Figure 1 reports the posterior distribution for A:, i.e. the number of clusters, from 
the RPMS model. The conhguration involving 14 clusters is clearly the one with the 
highest probability. This is the hrst signihcant difference with the SSP model that has 
a clear mode at fc = 1. This is due to the fact that the SSP takes into account only the 
variability in the regression coefficients. In the RPMS the covariates contribute to inform 
the partition of the observations. 

To summarise the posterior inference on the clustering we report the partition which 
minimises the Binder loss function [32] 

L(s, s) = ^ ^ {si=s^l} + ^2^{si=s^,}I{si^s^,}) 1 (9) 

where s is a proposed partition, while s indicates the true partition. The choice of 
the constants and ^2 allows to express the preference for a small number of large 
clusters or for a large number of small clusters, respectively. In our application we 
set = ^2 = ^ penalising both terms equally. In this application the true partition 
is unknown, whereas proposed partitions are represented by draws from the posterior 
distribution of the membership indicators. 

The posterior expectation of (9) is 

E(L(s, s) I Data) = ^ | I{si=s^,} - Iw \ 

where = E(J|s_s,,} | Data) and it can be consistently estimated computing the 
empirical probability of observations i and i' to be clustered together across the observed 
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k in RPMS k in SSP 


Figure 1: Posterior distribution of the number of clusters k for RPMS and for SSP models. 


MCMC iterations. Minimising the Binder loss function in our case leads to a configuration 
with 14 clusters, with the 9 largest clusters containing 92.5% of the observations. 

Figure 2 displays the presence of the 34 symptoms (on the columns) for the patients 
assigned to each of the nine largest clusters. Recalling that the symptoms can be grouped 
into four main categories, i.e. urgency, stress incontinence, voiding and pain symptoms, 
we can see that the largest cluster contains patients with a small number of symptoms 
belonging to all the four categories, or with no symptoms at all. In the second largest 
cluster, almost all patients present the fourth class of symptoms and almost none the third 
one. A high frequency of the other urgency symptoms is also evident. The third largest 
cluster includes patients with a high frequency of pain symptoms together with urgency 
symptoms (even though with a lower probability). The fourth cluster is characterised by 
a high frequency of urgency symptoms; the fifth cluster by a high frequency of voiding 
symptoms; the sixth cluster by a high frequency of incontinence symptoms; the seventh 
cluster by a high frequency of urgency and incontinence symptoms; the eighth cluster by 
a high frequency of urgency and pain symptoms and the ninth cluster by a high frequency 
of urgency and voiding symptoms. 

This distribution of the symptoms across the Binder configuration suggests that the 
symptoms classes are informative for the partition. Consequently, it is likely that each 
combination of symptoms has a particular effect on the WBC counts distribution; this 
is because cluster specific regression coefficients are associated to cluster specific proba¬ 
bilities of having the symptoms. 

6.5 Variable selection outputs 

Our proposed model performs simultaneously clustering and variable selection. It is 
of clinical interest to check which symptoms have a significant impact on the response 
variable. In our case, this means checking which symptoms are more likely to be predictive 
of an underlying infection. 

In this section we will use the two ways of summarising the variable selection infor- 
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Figure 2; Symptom indicators (black) for the 9 biggest clusters of the partition obtained 
by minimizing the Binder loss function. The horizontal axis of each panel (corresponding 
to a cluster) displays the index of the symptoms, whereas the index of the patients in 
each cluster is on the vertical axis. For each cluster the cardinality is also displayed. 


mation produced by the RPMS that have been described in section 5. The hrst one is 
based on the Binder estimate of the clustering conhguration, while the second focusses 
on the predictive distribution for a new patient. We hrst report the posterior probability 
of each symptoms to be included in the model, conditional on the Binder estimate of the 
clustering conhguration. 

Figure 3 displays the probability of inclusion, i.e. 1 = 0 | s, •) for the 9 largest 

clusters according to the Binder estimate ordered by size. For example, let us consider 
the hfth row (which refers to the hfth cluster). From Figure 2, we see that this cluster 
contains mainly the symptoms from 15 to 22 (cfr. the list in Table 6.1). Consequently, 
in Figure 3 the probability that symptoms 15, 16, 17, 19 are included in the regression 
model is close to 1. On the contrary, for symptoms 18, 20, 21 and 22 the probability 
of being included is low. Figure 3 also suggests the importance of the symptoms in the 
urgency class and of dysuria and loin pain within the pain class. 
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Figure 3: Probability of inclusion, i.e. (3 ^ 0, for each symptoms in the 9 biggest clusters 
of the partition estimated by minimizing the Binder loss function. 


The second way of presenting variable selection results is by considering predictive 
inference. Recall that for a new patient entering the study the distribution of the regres¬ 
sion coefficients depends on his profile, i.e. x, through the cluster assignment. This does 
not happen in the SSP, in which the predictive probability for the cluster assignment 
depends exclusively on the cardinality of the clusters. 

To illustrate the last considerations, we take x include the presence of symptoms 1, 2 
and 4 from Table 6.1. Figure 4 shows the density estimation of the posterior distribution 
of the regression coefficients related to the three symptoms in x. We present the output 
from both the RPMS and the SSP. The evident differences between the distributions are 
due to the fact that in the SSP the regression coefficients do not depend on the profile, 
which they do in the RPMS. Consequently, in the SSP the posterior distribution of the 
regression coefficients is the same for every (new) patient, while in the RPMS it can vary, 
depending on the covariates prohle. Moreover, in the RPMS the spike and slab prior 
distribution can be seen as a within-cluster prior. 

The different posterior distributions of the regression coefficients for the SSP and the 
RPMS have obviously an impact on the predictive distribution of y. Figure 5 displays the 
predictive distribution of the response given a profile x (we assume these are the same 
as those considered in Figure 4) and for a different profile x', which is characterized by 
a large number of symptoms: Xi = X 2 = = Xq = X 7 = X 22 = X 2 z = a ^27 = 

3^28 = 3132 = 3133 = 1 . 

In the first case, the distributions obtained from the SSP and the RPMS have similar 
means, but the one estimated by the RPMS seems to have smaller variance. On the other 
hand for x' the predictive distributions seem more substantially different. 

In order to determine whether the proposed model leads to improved predictions we 
employ the Brier statistic [34] . This statistic assesses the quality of predictions when the 
response variable is binary: 

1 

Brier = - ^(/i - o^)^ 
n 

i=\ 

where Oi is a binary observation and /* is its predicted probability. 

In our case the response variable is continuous, so in order to apply this statistic we 
discretize the response variable. In this case we are interested in predicting whether the 
WBC count of an individual is above or below a specific threshold. We use the quartiles 
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Figure 4: Kernel density estimation of the posterior distribution for /3i, (32 and (3^ given 
the new profile x with xi = X 2 = X 4 = 1. The first row refers to the RPMS model, 
while the second row refers to the SSP model. For SSP the posterior distribution for the 
regression coefficients does not depend on x. 


of the observed WBC counts as thresholds. Then, the predicted /j will be the probability 
of obtaining values larger than the specified threshold. For each patient i in the sample, 
we estimate /* from the predictive distribution. Hence, the Brier statistic becomes: 

i=l 

where is the i—th response discretised with respect to the g—th quartile, whilst 
is the predictive probability that an individual has a WBC value larger than the g—th 
quartile. The sum is taken over all the patients in the sample. 

We compare the posterior distribution of Brier(g) estimated by the SSP and the RPMS. 
Small values of the Brier statistic indicate good classification performance. The posterior 
distribution for the SSP and the RPMS are displayed in Figure 6 , together with the kernel 
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Figure 5: Kernel density estimation of the predictive distribution of y given profile x 
with xi = X2 = X4 = 1 and of y' given prohle x' with xi = X2 = x^ = x^ = x^ = x^ = 
Xj = X22 = X23 = X 27 = X28 = X32 = X33 = 1 


estimate of the response variable. We consider the hrst quartile, the median and the third 
quartile to discretise our response. In all three cases the location of the distribution of 
Brier(g) is lower for the RPMS model, indicating better predictive power compared to 
SSP. 



Figure 6; Left panel; kernel density estimate of the response variable log(WBC). The 
vertical lines correspond to the quartiles. Right panel: posterior distributions of the 
Brier statistic using as threshold the hrst quartile, the second quartile (median) and 
third quartile. Results are shown for SSP and RPMS models. 


Evidence in favour of the hypothesis that WBC > 1 indicates the presence of UTI is 
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given in [1], This recent resnlt extends the analogous one of [2], where WBC > 10 was 
considered. All patients in our study have WBC > 1 and thus it is very likely they are 
affected by UTI. However, if on the one hand a large number of WBC in a sample of 
urine will increase the conhdence about the presence of UTI, on the other hand no work 
has been published that describes the severity of infection in relation to higher values 
of WBC. Nevertheless, specialists consider fully reasonable to associate high degree of 
inflammation to large values of WBC. 

Hence, discretising the response variable to make prediction is reasonable both to 
assess the likelihood of having an infection and to evaluate the general status of the 
disease (in terms, for example, of the degree of inflammation). Moreover, discretising the 
response variable transforms the problem into a classihcation one, which links our model 
to the very important area of risk prediction models. A review of these models can be 
found in [35], who highlight the most common techniques to perform model selection in 
this context. 


7 Conclusions 

In this work we have proposed the RPMS, a DPM of Normal regressions with covariate 
dependent weights, capable of simultaneously performing clustering and variable selec¬ 
tion. This is achieved employing a DPM on the joint distribution of the response and the 
covariates, together with spike and slab prior distributions on the regression coefficients 
within the clusters of the partition. The latter allows performing variable selection and 
therefore identifying covariates with high explanatory power on the response. The pro¬ 
posed method is designed to handle binary covariates, due the dataset motivating this 
work, even though it is straightforward to include other types of covariates (and also 
mixed types). Although we have presented the model for Normally distributed response 
data, it is possible to extend it to the generalised linear model framework. 

The main feature of the model lies in the fact that the RPMS takes into account also 
possible patterns within the covariate space. The results of the analysis highlight the 
diagnostic power of the symptoms. On the other hand, the SSP focuses on the variability 
within the response. The results of the posterior inference show that in the partition 
generated by the RPMS the clusters are characterized by the presence of certain classes 
of symptoms or combinations of classes, revealing that these classes are informative and 
further investigation might lead to the identihcation of disease sub-types. 

The results of the variable selection has been summarized in two ways: hxing a mean¬ 
ingful partition (we have opted for the Binder estimate) or hxing a specihc prohle in a 
predictive fashion. In the hrst case, the analysis of the posterior distribution of the regres¬ 
sion coefficients conditional to the Binder partition shows the overall importance of the 
urgency symptoms, and the cluster-specihc importance of certain particular symptoms. 
The second way to display the variable selection output is from a predictive perspective. 
We have assumed that a new patient’s prohle has been collected. The distribution of 
the regression coefficients for the new patient depends on the prohle and this permits an 
individual-based assessment of the important symptoms that determine the distribution 
of WBC. The SSP’s estimated posterior distributions of the regression coefficients instead 
do not depend on the patient prohle. This diherence permits the RPMS to achieve more 
accurate prediction of the WBC compared to the SSP. 

We believe that the use of Bayesian non-parametric methods, although computation- 
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ally more expensive, offers the flexibility necessary to captnre the complexity of modern 
clinical data and conseqnently improved predictive power, especially in cases where the 
nse of parametric approaches wonld impose nnrealistic assnmptions on the data generat¬ 
ing process. 


Supplementary Material 

The resnlts of a simnlation stndy are presented in Snpplementary Material. 

A Posterior Inference 

In this appendix we present the details of the npdating steps of the Gibbs sampler scheme 
adopted. 

Membership Indicator 

This step follows the npdating Gibbs-type algorithm for DPM with non conjngate base 
measnre in [30] called auxiliary parameter approach. Let first define s* to be the member¬ 
ship indicator for the observation i and S(j) to be the vector of the membership indicator 
for the n observations bnt from which s* is removed. Let ns also define k~ to be the 
nnmber of clnsters when i is removed, n~ for j = 1 ,..., to be the cardinality of the 
clusters when i is removed. Thus the full conditional distribution for each indicator is: 

p{si I 

^ f njp{yi I Xi, (3*, A) Hii Pi^id \ Qd) J = 1, ■ ■ ■, 

I fiPiVi I Xi, f3*^, A) n?=i P{xid I Cd) J = A;- + 1 and m = 1 ,..., M 

where {(3^, Cm) m = 1,..., M are draws from the the base measure in (4.2.1). 

Precision of the DP 

In order to update the parameter a of the DP we need to introduce an additional param¬ 
eter u such that p{u \ k,a) = Beta(Q; -|- (see [27] for detailed explanation). We can 
sample from the full conditional: 

p{a \ u,k) = ^Gamma(aa + k,ba — log(M)) - 1 - (1 — ^)Gamma(aQ + k — l,ba — log(n)) 
where C = (oq + k — 1 )/(q;i + k — 1 + nba — 'nlog(M)). 

Covariate Parameters 

For the update of the parameters of the covariates, we work separately for each of the D 
covariates within each of the k clusters. Thus the full conditional posterior distributions 
are: 


PiCjd I x*d) oc Yl Bernoulli(a;id | Cj'd)Beta(C;rf | a^, 6c) 

i&Sj 
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Cjd \ ■ ^ Beta j Cjd I ac + X] ^id, bi; - ^Xid + nj \ , j = l,...,k and d=l,...,D 
\ *SSj i&Sj J 

Regression Coefficients 

As for the case of the parameters of the covariates, we consider separately each D and 
each of the k clusters. It follows that; 


pW;, I X,y,/3-,„) cc n Normal(?/i | Xi, f3*, \)[ndWujSo{/3*a) + {l-ndWuj)^OTmal{/3*^ \ md,Td)] 

i&Sj 


TTdWu,So{l3*^)N{yi I Xi,f3*,\) + {1 - TTdWuj)^OTma\{l3*^ \ rd)Normal(?/i | Xi,f3*,\) 


i^Sj 


(^*j(d) vector [3* where the dth component is removed. The hrst part of the 

last equation will be 0 with some probability. Let us consider the second part of that 
equation. This is proportional to 


exp {-^TdWjd - mdf^exp <j ^ X{yi - Xi(^d)(3*(^d) “ Xidl3*df J> = 

i&Si 


= exp < —; 


/3*id{rd + Xxfa) - 2/3L I mdTd + A V(a;irfA) 


jd ' 


jd 


i£Si 


with Ai = {]ji — Xi(d)l^*j(^d))i ^i(d) tf^e vector Xi where the dth is removed. 
Thus, the full conditional probabilities for the Gibbs sampler are: 


0 


w. p. 9 


/S. 


jd 


jd 


Normal 


rrid'Td+Y.ii^S ■ A^idAi) 


Td+Eies, A^ld 
Finally the weights 0 = (0i,..., 0^) are: 

9jd = 

with C'. 


Td + Eie5, (^^id) 1 W. p. (1 - Ojd) 


^dWuj 


T^dWui + (1 - T^dW^)C 


c = 




^d + I T-^exp \ -\Tdml + ^ ( + ^^(a^idA) 

ies, / ^ \ i&Si 


mdTd + '^(XxidAi) 

i&Sj 
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Weights of the Spike and Slab Prior 

Following what was done in [17] let us call = Tid.u’oj and = a^/ 


p(rd) = Beta 




1 


1 



The full conditional is the following: 


Td) 


btj — 1 


p{rd I /3rf) oc p{rd)rf' h/3‘d7^o) 

This is an unknown distribution and a draw from it is obtainable computing the 
inverse of the cumulative distribution function over a grid of values. We select the point 
on the grid that gives the closest value of the inverse cumulative distribution function to 
a draw from a uniform distribution on (0,1). 


Precision of the Base Measure 

We will update the precision of the normal part of the base measure considering separately 
each of the D covariates. 


k 

p{Td I f3*, ttr, hr) OC Gamma(rd | a^, hr) W[T^dW^5Q{(3*^) + (1 - 'KdW^)N{l3*^ \ rud, Td)] = 

1=1 


= Gamma(rrf | ar,hr)Y\_N{f3^^ \ md,Td) 

1=1 

where is the number of clusters that have non zero coefficients in position d, 
whereas for j = 1,..., nj" is the list of these non zero coefficients. Thus, it is possible 
to draw from the following known distribution: 


Td 


,ar,hr ~ Gamma 



dr + 



- rUd) 


Precision of the Regression 

The precision of the regression is updated in a conjugate form as it follows: 

k 

p{\ I y,X,f3*,ax,hx) oc nn Normal(?/j | ajj,/3*, A)Gamma(A | ax,hx) 

j=l i£Sj 

A I y, X, (3*, ax, hx ~ Gamma ( A | n/2 + ax, ^ *A) V2 + hx 

\ i=i 
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